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Abstract 

With the publication of the first eukaryotic species description, combining transcriptomic, DNA barcoding, and 
micro-CT imaging data, GigaScience and Pensoft demonstrate how classical taxonomic description of a new species 
can be enhanced by applying new generation molecular methods, and novel computing and imaging 
technologies. This 'holistic' approach in taxonomic description of a new species of cave-dwelling centipede is 
published in the Biodiversity Data Journal (BDJ), with coordinated data release in the GigaScience GigaDB database. 



Background 

The challenge 

While much has been written on the data deluge in gen- 
omics, biodiversity research has undergone a similar ex- 
plosion in the throughput and volume of data produced. 
With increasingly threatened habitats, free and open ac- 
cess to this data is essential for informed decision-making 
on conservation issues. Much of this growth has been led 
by advances in DNA barcoding, and by combining bulk- 
sampling with genomic technology, the technique of 
metabarcoding will increase this flood of data even further. 
With growing intensities in sampling via mass sampling of 
arthropods, mass detection of environmental DNA in 
aquatic environments, and broad overviews of plant com- 
munities, these sophisticated analyses allow temporal and 
spatial assessment of biodiversity across varied environ- 
ments at previously unobtainable levels of detail. 

These new ecoinformatics and biomonitoring tech- 
niques are able to work quantitively [1], so in addition to 
ecosystem assessment, they also allow biodiversity surveys 
and the discovery of new species, even inside metropolitan 
areas that should be comparatively well sampled [1]. 

Traditional descriptive taxonomy has failed to keep pace 
with the explosive growth of sequencing. As a conse- 
quence there has been a huge increase in the number of 
"dark taxa" within public sequence databases. These are 
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taxa that are not identified to a known species, either be- 
cause they are new to science, or because the specimen 
has never been identified. In many cases dark taxa are 
already represented within museum collections and have 
published descriptions However, there is no mechanism 
by which taxonomists can easily verify the identity of dark 
taxa, and even if there were, describing them quickly and 
efficiently was impossible until recently, due to the no- 
menclatural rules prohibiting the description of new spe- 
cies in electronic only publications. The increasing pace of 
species extinction, coupled with the decreasing pool of 
taxonomic expertise, means that there is an urgent need 
to speed up the process of investigating biodiversity. 



Potential solutions 

From September 2012 the process of describing animal 
species joined the electronic era, with the acceptance of 
electronic taxonomy publication and registration with 
ZooBank, the official registry of the ICZN (International 
Trust for Zoological Nomenclature). The genomic ex- 
plosion has led to a rapid increase in the number of ref- 
erence genomes, and the production of transcriptomes 
is becoming an even faster and more cost-effective sub- 
stitute to produce massive amounts of gene sequence 
data for genetic and phylogenomic studies. The pace of 
traditional taxonomy is, in some instances, catching up 
with genome sequencing, as was demonstrated with a 
new Strepsiptera genome [2] which was published back- 
to-back with its species description in Zookeys [3]. 
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While the barcoding community has produced work- 
arounds for the lack of species descriptions, such as the 
use of interim taxonomic nomenclature (operational 
taxonomic units) in their sample registries, the use of, 
DNA-based classifications were initially restricted to 
'taxonomy-free' groups such as bacteria and fungi. The 
new Barcode Index Number (BIN) system allows cluster- 
ing of sequences into "BINs", and can aid revisionary 
taxonomy by flagging possible cases of synonymy [4]. 

On top of advances in sequencing technology, new im- 
aging techniques are providing ways to study morphology 
and animal behavior in unprecedented and reproducible 
detail, and in a non-destructive manner. Subrobotic digital 
imaging can rapidly process stacks of images through col- 
lections. Digital video allows for archiving of in-situ behav- 
ior, while the use of X-ray micro-computed tomography 
scanning (microCT) supports three-dimensional virtual 
representations of materials. The use of these data as 
virtual type specimens has been promoted through the 
concept of "cybertypes". These digital representations of 
exemplar specimens create the potential for new forms of 
collections that can be openly accessed and used without 
the physical constraining of loaning specimens or visiting 
natural history collections [5]. 

Some have suggested a 'turbo-taxonomy approach, 
combining all of these techniques to address a perceived 
decline in taxonomic expertise [6,7]. This putative pipe- 
line has recently been demonstrated with large series of 
parasitic wasps [6] and Trigonopterus weevils [7]. While 
these examples have focused on taxonomic throughput, 
less attention has been given to the potential to integrate 
these different data types. 

The example 

GigaScience and Pensoft Publishers present the results of 
a pilot study aiming to demonstrate how the classical 
taxonomic description of a new species can be enhanced 
by utilizing the latest molecular methods, and novel com- 
puting and imaging technologies. A new species of cave- 
dwelling centipede, Eupolybothrus cavernicolus Komericki 
& Stoev (Chilopoda: Lithobiomorpha: Lithobiidae) [8], re- 
cently discovered underground in a Croatian cave, is the 
first Eukaryotic species description for which, in addition 
to traditional morphological description, the authors pro- 
vide a fully sequenced transcriptome, DNA barcodes and 
BIN entries, detailed anatomical X-ray micro-CT scans, as 
well as a movie of the living specimen to document im- 
portant traits of its behavior [9]. 

Communicating the results of next generation sequen- 
cing effectively requires the next generation of data pub- 
lishing. The description published in the newly launched 
Biodiversity Data Journal {BDJ) aims to provide a gold 
standard for not just the quantity and diversity of data 
available, but for quality and amount of metadata to make 



this data reusable and interoperable. It also demonstrates 
the benefits of integrating a scholarly publishing workflow 
that allows authors, curators and editors to write, peer- 
review, publish, and disseminate biodiversity data within a 
single web-based platform [10]. GigaSciences contribution 
to the pilot is using the GigaDB database for large-scale 
data handling, management, curation and storage (see [9]). 
The data are also available in relevant community specific 
databases, with transcriptomic sequencing data in both 
ENA and ArrayExpress, plus annotation data made publi- 
cally available through ArrayExpress to the most stringent 
(MINSEQE) metadata standards. Imaging data is depos- 
ited in morphological databases, and biodiversity data in 
the Barcode of Life databases. All data are made available 
with no restrictions on reuse under the most open CCO 
public domain waiver. The publication of Stoev et al, in 
this manner provides a significant step forward from inte- 
grating small data sets in the article text in both 
computer- and human- readable formats, into the world of 
big data publishing. 

To tackle complex and novel scientific questions, data- 
sets and metadata from different sources need to be har- 
monized and made interoperable. Working with the ISA 
community we have provided metadata in the interoper- 
able ISA-TAB format to maximize the discovery, exchange 
and informed integration of these diverse datasets. Until 
recently there has been a lack of incentives for data pro- 
ducers to make their data available, but this data note pro- 
vides an example of how credit can be obtained for 
providing this effort. While the focus is on providing data 
rather than analysis, there are interesting questions to be 
asked such as on the evolution of the species, develop- 
ment of its segmented body structure, and how it has 
adapted to its dark cave environment. By providing such a 
diverse range of phenotypic and molecular data in an inte- 
grated and reusable form, we hope to enable other re- 
searchers to explore these and other questions. While this 
new species subterranean lifestyle could hopefully protect 
it from some of the growing threats surface habitats are 
encountering, this new type of species description also 
provides an example of how much previously uncharacter- 
ized information on its behavior, internal structure, physi- 
ology and genetic make-up can be preserved for future 
generations. 
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